9 research outputs found

    Anti-Abuse Protection of Online Social Networks using Machine Learning

    No full text
    Au cours de la dernière décennie, la popularité incomparable des réseaux sociaux numériques s’est traduite par l’omniprésence des spammeurs sur ces plateformes. Cette présence a commencé par se manifester sous la forme de messages de publicité et d’arnaques traditionnels simples à identifier. Pourtant, elle s’est métamorphosée durant les dernières années, et couvre dorénavant de larges tentatives de manipulation qui sont assez importantes et beaucoup plus préoccupantes. Cet abus ciblé et largement automatisé des réseaux sociaux numériques réduit la crédibilité et l’utilitédes informations diffusées sur ces plateformes. Le problème de détection du spam social a été traditionnellement modélisé comme un problème de classification supervisée où l’objectif est de classer les comptes sociaux individuellement. Ce choix est problématique pour deux raisons. Tout d’abord, la nature dynamique du spam social rend les performances des systèmes supervisés difficiles à maintenir. En outre, la modélisation basée sur les caractéristiques (features) des comptes sociaux individuels ne prend pas en compte le contexte collusoire dans lequel les attaques sur les réseaux sociaux sont menées. Pour maximiser leur efficacité et la visibilité de leur contenu, les spammeurs actent d’une manière qu’on peut décrire comme “synchronisée”. Ainsi, même lorsqueles spammeurs changent de caractéristiques, ils continuent à agir de manière collusoire, créant des liens entre les comptes complices. Ceci constitue un signal non supervisé qui est relativement facile à maintenir et difficile à contourner. Il est donc avantageux de trouver une mesure de similarité adaptée qui soit capable de capturer ce comportement collusoire. Dans ce travail, nous proposons d’exprimer le problème de détection de spam social en termes probabilistes en utilisant le cadre des modèles graphiques non dirigés. Au lieu du paradigme de détection individuelle qui est couramment utilisé dans la littérature, nous cherchons à modéliser la tâche de classification comme une tâche d’inférence sur la probabilité jointe d’un graphe de variables. Dans ce contexte, les comptes sont représentés comme des variables aléatoires et la dépendance entre ces variables est représentée par un graphe. Cette expression probabiliste permet de modéliser l’incertitude inhérente aux systèmes de classification. Le graphe permet aussi d’exploiter la dépendance qui découle de la similitude induite par le comportement collusoire des spammeurs. Nous proposons deux modèles graphiques: le Champs Aléatoire de Markov où l’inférence est effctuée par l’algorithme de Propagation des Convictions à Boucle, et le Champs Aléatoire Conditionnel, où on choisit d’utiliser l’algorithme du Tree Reweighted Message Passing pour l’inférence et une fonction de perte qui minimise le risque empirique. Les deux modèles, évalués sur Twitter, montrent une augmentation des performances de classification par rapport aux classifieurs supervisés de la littérature. Le Champ Aléatoire Conditionnel offre de meilleures performances de classification par rapport au Champs Aléatoire de Markov. Il est aussi plus robuste face aux changements dans la distribution des caractéristiques des spammeurs.Over the last decade, the growing popularity of Online Social Networks has attracted a pervasive presence of social spammers. While this presence has started with spam advertising and common scams, the recent years have seen this escalate to the far more concerning mass manipulation attempts. This targeted and largely automated abuse of social platforms is risking the credibility and usefulness of the information disseminated on these platforms. The social spam detection problem has been traditionally modeled as a supervised problem where the goal is to classify individual social accounts. This common choice is problematic for two reasons. First, the dynamic and adversarial nature of social spam makes the performance achieved by features-based supervised systems hard to maintain. Second, features-based modeling of individual social accounts discards the collusive context in which social attacks are increasingly undertaken. Acting synchronously allows spammers to gain greater exposure and efficiently disseminate their content. Thus, even when spammers change their characteristics, they continue to act collusively, inevitably creating links between collusive spammingaccounts. This constitutes an unsupervised signal that is relatively easy to maintain and hard to evade. It is therefore beneficial to find a suitable similarity measure that captures this collusive behavior. Accordingly, we propose in this work to cast the social spam detection problem in probabilistic terms using the undirected graphical models framework. Instead of the individual detection paradigm that is commonly used in the literature, we aim to model the classi_cation task as one of joint inference. In this context, accounts are represented as random variables and the dependency between these variables is encoded in a graphical structure. This probabilistic setting allows to model theuncertainty that is inherent to classification systems while simultaneously leveraging the dependency that _ows from the similarity induced by the spammers collusive behavior. We propose two graphical models: the Markov Random Field with inference performed via Loopy Belief Propagation, and the Conditional Random Field with a setting that is more adapted to the classification problem, namely by adopting the Tree Reweighted message passing algorithm for inference and a loss that minimizes theempirical risk. Both models, evaluated on Twitter, demonstrate an increase in classification performance compared to state-of-the-art supervised classifiers. Compared to the Markov Random Field, the proposed Conditional Random Field framework offers a better classification performance and a higher robustness to changes in spammers input distribution

    Protection Anti-Abus de Réseaux Sociaux Numériques par Apprentissage Statistique.

    No full text
    Over the last decade, the growing popularity of Online Social Networks has attracted a pervasive presence of social spammers. While this presence has started with spam advertising and common scams, the recent years have seen this escalate to the far more concerning mass manipulation attempts. This targeted and largely automated abuse of social platforms is risking the credibility and usefulness of the information disseminated on these platforms. The social spam detection problem has been traditionally modeled as a supervised problem where the goal is to classify individual social accounts. This common choice is problematic for two reasons. First, the dynamic and adversarial nature of social spam makes the performance achieved by features-based supervised systems hard to maintain. Second, features-based modeling of individual social accounts discards the collusive context in which social attacks are increasingly undertaken. Acting synchronously allows spammers to gain greater exposure and efficiently disseminate their content. Thus, even when spammers change their characteristics, they continue to act collusively, inevitably creating links between collusive spammingaccounts. This constitutes an unsupervised signal that is relatively easy to maintain and hard to evade. It is therefore beneficial to find a suitable similarity measure that captures this collusive behavior. Accordingly, we propose in this work to cast the social spam detection problem in probabilistic terms using the undirected graphical models framework. Instead of the individual detection paradigm that is commonly used in the literature, we aim to model the classi_cation task as one of joint inference. In this context, accounts are represented as random variables and the dependency between these variables is encoded in a graphical structure. This probabilistic setting allows to model theuncertainty that is inherent to classification systems while simultaneously leveraging the dependency that _ows from the similarity induced by the spammers collusive behavior. We propose two graphical models: the Markov Random Field with inference performed via Loopy Belief Propagation, and the Conditional Random Field with a setting that is more adapted to the classification problem, namely by adopting the Tree Reweighted message passing algorithm for inference and a loss that minimizes theempirical risk. Both models, evaluated on Twitter, demonstrate an increase in classification performance compared to state-of-the-art supervised classifiers. Compared to the Markov Random Field, the proposed Conditional Random Field framework offers a better classification performance and a higher robustness to changes in spammers input distribution.Au cours de la dernière décennie, la popularité incomparable des réseaux sociaux numériques s’est traduite par l’omniprésence des spammeurs sur ces plateformes. Cette présence a commencé par se manifester sous la forme de messages de publicité et d’arnaques traditionnels simples à identifier. Pourtant, elle s’est métamorphosée durant les dernières années, et couvre dorénavant de larges tentatives de manipulation qui sont assez importantes et beaucoup plus préoccupantes. Cet abus ciblé et largement automatisé des réseaux sociaux numériques réduit la crédibilité et l’utilitédes informations diffusées sur ces plateformes. Le problème de détection du spam social a été traditionnellement modélisé comme un problème de classification supervisée où l’objectif est de classer les comptes sociaux individuellement. Ce choix est problématique pour deux raisons. Tout d’abord, la nature dynamique du spam social rend les performances des systèmes supervisés difficiles à maintenir. En outre, la modélisation basée sur les caractéristiques (features) des comptes sociaux individuels ne prend pas en compte le contexte collusoire dans lequel les attaques sur les réseaux sociaux sont menées. Pour maximiser leur efficacité et la visibilité de leur contenu, les spammeurs actent d’une manière qu’on peut décrire comme “synchronisée”. Ainsi, même lorsqueles spammeurs changent de caractéristiques, ils continuent à agir de manière collusoire, créant des liens entre les comptes complices. Ceci constitue un signal non supervisé qui est relativement facile à maintenir et difficile à contourner. Il est donc avantageux de trouver une mesure de similarité adaptée qui soit capable de capturer ce comportement collusoire. Dans ce travail, nous proposons d’exprimer le problème de détection de spam social en termes probabilistes en utilisant le cadre des modèles graphiques non dirigés. Au lieu du paradigme de détection individuelle qui est couramment utilisé dans la littérature, nous cherchons à modéliser la tâche de classification comme une tâche d’inférence sur la probabilité jointe d’un graphe de variables. Dans ce contexte, les comptes sont représentés comme des variables aléatoires et la dépendance entre ces variables est représentée par un graphe. Cette expression probabiliste permet de modéliser l’incertitude inhérente aux systèmes de classification. Le graphe permet aussi d’exploiter la dépendance qui découle de la similitude induite par le comportement collusoire des spammeurs. Nous proposons deux modèles graphiques: le Champs Aléatoire de Markov où l’inférence est effctuée par l’algorithme de Propagation des Convictions à Boucle, et le Champs Aléatoire Conditionnel, où on choisit d’utiliser l’algorithme du Tree Reweighted Message Passing pour l’inférence et une fonction de perte qui minimise le risque empirique. Les deux modèles, évalués sur Twitter, montrent une augmentation des performances de classification par rapport aux classifieurs supervisés de la littérature. Le Champ Aléatoire Conditionnel offre de meilleures performances de classification par rapport au Champs Aléatoire de Markov. Il est aussi plus robuste face aux changements dans la distribution des caractéristiques des spammeurs

    Supervised Classification of Social Spammers using a Similarity-based Markov Random Field Approach

    No full text
    International audienceSocial spam has been plaguing online social networks for years. Being the sites where online users spend most of their time, the battle to capitalize and monetize users' attention is actively fought by both spammers and legitimate sites operators. Social spam detection systems have been proposed as early as 2010. They commonly exploit users' content and behavioral characteristics to build supervised classifiers. Yet spam is an evolving concept, and developed supervised classifiers often become obsolete with the spam community continuously trying to evade detection. In this paper, we use similarity between users to correct evasion-induced errors in the predictions of spam filters. Specifically, we link similar accounts based on their shared applications and build a Markov Random Field model on top of the resulting similarity graph. We use this graphical model in conjunction with traditional supervised classifiers and test the proposed model on a dataset that we recently collected from Twitter. Results show that the proposed model improves the accuracy of classical classifiers by increasing both the precision and the recall of state-of-the-art systems

    SimilCatch: Enhanced social spammers detection on Twitter using Markov Random Fields

    No full text
    International audienceThe problem of social spam detection has been traditionally modeled as a supervised classification problem. Despite the initial success of this detection approach, later analysis of proposed systems and detection features has shown that, like email spam, the dynamic and adversarial nature of social spam makes the performance achieved by supervised systems hard to maintain. In this paper, we investigate the possibility of using the output of previously proposed supervised classification systems as a tool for spammers discovery. The hypothesis is that these systems are still highly capable of detecting spammers reliably even when their recall is far from perfect. We then propose to use the output of these classifiers as prior beliefs in a probabilistic graphical model framework. This framework allows beliefs to be propagated to similar social accounts. Basing similarity on a who-connects-to-whom network has been empirically critiqued in recent literature and we propose here an alternative definition based on a bipartite users-content interaction graph. For evaluation, we build a Markov Random Field on a graph of similar users and compute prior beliefs using a selection of state-of- the-art classifiers. We apply Loopy Belief Propagation to obtain posterior predictions on users. The proposed system is evaluated on a recent Twitter dataset that we collected and manually labeled. Classification results show a significant increase in recall and a maintained precision. This validates that formulating the detection problem with an undirected graphical model framework permits to restore the deteriorated performances of previously proposed statistical classifiers and to effectively mitigate the effect of spam evolution

    An IoT-Cloud Based Solution for Real-Time and Batch Processing of Big Data: Application in Healthcare

    No full text
    International audienceWith the large use of Internet of Things (IoT) today everything around us seems to generate data. The ever increasing number of connected things or objects (IoT) is coupled with a growing volume of data generated at a continually increasing rate. Especially where data is big or there is a need to process it cloud infrastructures with their scalability and easy access are becoming the solution of choice for storage and processing. In the context of healthcare applications where medical sensors collect health data from patients and send it to the cloud two issues frequently appear in relation to 'Big Data'. The first issue is related to real-time analysis introduced by the increasing velocity at which data is generated especially from connected devices (IoT). This data should be analyzed continuously in real-time in order to take appropriate actions regarding the patient's care plan. Moreover medical data accumulated from different patients over time constitutes an important training dataset that can be used to train machine learning models in order to perform smarter disease prediction and treatment. This gives rise to another issue regarding long-term batch processing of often huge volumes of stored data. To deal with these issues we propose an IoT-Cloud based framework for real-time and batch processing of Big Data in the healthcare domain. We implement the proposed solution on Amazon Cloud operator known as Amazon Web Services (AWS) and use a Raspberry pi as an IoT device to generate data in real time. We test the solution with the specific application of ECG monitoring and abnormality reporting. We analyze the performance of the implemented system in terms of response time by varying the velocity and volume of the analyzed data. We also discuss how the cloud resources should be provisioned in order to guarantee processing performance for both long-term and real-time scenarios. To ensure a good tradeoff between cost and processing performance resources provision should be adapted to the exact needs and characteristics of the considered application

    Performance/cost analysis of a cloud based solution for big data analytic: Application in intrusion detection

    No full text
    International audienceThe essential target of ‘Big Data’ technology is to provide new techniques and tools to assimilate and store large amount of generated data in a way to analyze and process it to get insights and predictions that can offer new opportunities towards the improvement of our life in different domains. In this context, ‘Big Data’ treats two essential issues: the real-time analysis issue introduced by the increasing velocity at which data is generated, and the long-term analysis issue introduced by the huge volume of stored data. To deal with these two issues, we propose in this paper a Cloud-based solution for big data analytic on Amazon Cloud operator. Our objective is to evaluate the performance of Big Data services offered regarding the volume/velocity of the processed data. The dataset we use contains information about”network connections” in approximately 5 million records with 41 features; the solution works as a network intrusion detector. It receives data records in real time from a raspberry pi node and predicts if the connection is bad (malicious intrusion or attack) or good (normal connection). The prediction model was made using a logistic regression network. We evaluate the cloud resources needed to train the machine learning model (batch processing), and to predict the new streaming data with the trained network in real time (real time processing). The solution worked very well with high accuracy and the results show that when working with Big Data in the cloud, we are mainly dealing with a cost/performance trade-off, the processing performance in term of response time for both long-term and real-time analysis can be always guaranteed once the cloud resources are well provisioned according to the needs
    corecore